During the second week of each unit, we’ll “walk through” a basic research workflow, or data analysis process, modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):
Figure 2.2 Steps of Data-Intensive Research Workflow
Each walkthrough will focus on a basic analysis guided by the social network perspective.
This week, our focus will be on preparing relational data for analysis, looking at some basic network stats, and creating a network visualization that helps illustrate key findings. Specifically, the Unit 1 Walkthrough will cover the following workflow topics:
Prepare: Prior to analysis, we’ll take a look at the context in which are data is derived, you’re working with so you can formulate useful and answerable questions. You’ll also need to set up a “Project” for our Unit 1 walkthrough.
Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In section 2 we focus on reading, reducing, and tidying our data.
Explore: In section 3, we use simple summary statistics, more sophisticated approaches like term frequency-inverse document frequency (tf-idf), and basic data visualization to explore our data and see what insight it provides in response to our question.
Model: While we won’t investigate approaches to Model our data until Unit 3 when we learn about community detection algorithms and exponential random graph models (ERGM), we will see how modeling has been applied.
Communicate:
Prior to analysis, it’s critical to understand the context and data sources available so you can formulate useful questions that can be feasibly addressed by your data. For this section, we’ll focus on the following topics:
In Social Network Analysis and Education: Theory, Methods & Applications, Carolyn (2013) notes that:
the social network perspective is one concerned with the structure of relations and the implication this structure has on individual or group behavior and attitudes
More specifically, Carolyn cites the following four features used by Freeman (2004) to define the social network perspective:
Social network analysis is motivated by a relational intuition based on ties connecting social actors.
It is firmly grounded in systematic empirical data.
It makes use of graphic imagery to represent actors and their relations with one another.
It relies on mathematical and/or computational models to succinctly represent the complexity of social life.
For Unit 1, our walkthrough will be guided by previous research and evaluation work conducted by the Friday Institute for Educational Innovation as part of the Massively Open Online Courses for Educators (MOOC-Ed) initiative.
A Social Network Perspective on Peer Supported Learning in MOOC-Eds was framed by three primary research questions related to peer supported learning:
What are the patterns of peer interaction and the structure of peer networks that emerge over the course of a MOOC-Ed?
To what extent do participant and network attributes (e.g., homophily, reciprocity, transitivity) account for the structure of these networks?
To what extent do these networks result in the co-construction of new knowledge?
For our very first walkthrough, we are going to focus exclusively on RQ1 from the original study. Our question of interest is focused on very basic questions about our educator network:
How, and to what extent, did educators engage with other participants in the discussion forums?
Based on our course readings and your self-selected readings, what subquestions, or more specific research questions, might ask that help you answer the broader question we’ll be focused on for this walkthrough?
In the space below, type a brief response to the following questions:
-
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR):
Packages are shareable collections of R code that can contain functions, data, and/or documentation. Packages increase the functionality of R by providing access to additional functions to suit a variety of needs.
Let’s check to see which packages have already been loaded into our RStudio Cloud workspace. Take a look at the the Files, Plots, & Packages Pane in the lower right hand corner of RStudio Cloud to make sure these packages have been installed and loaded:
You should see some familiar tidytext packages from our Getting Started Walkthrough like {dplyr} and {readr} which we’ll be using again shortly. You should also see an important package call {igraph} that will rely on heavily for our network analyses.
If you are working in RStudio Desktop, or notice that the packages have not been installed and/or loaded, run the following install.packages() function code to install the {tidyverse} and {igraph} packages:
install.packages("tidyverse")
install.packages("igraph")
And library() function to load them:
library(tidyverse)
library(igraph)
At the end of this week, I’ll ask that you share with me your r script as evidence that you have completed the walkthrough. Although I highly recommend that that you manually type the code shared throughout this walkthrough, for large blocks of text it may be easier to cut and paste.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from the raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
For our data wrangling this week, we’re keeping it simple since working with network data is a bit of a departure from our working with rectangular data frames. Our primary goals for Unit 1 are learning how to:
Import Data. Before working with data, we need to “read” it into R and once imported, we’ll take at different ways to view our data in R.
Create a Network Object.
Simplify Network. Finally, we’ll learn about a handy simplify() function in the {igraph} package for removing ties that .
To get started, we need to import, or “read”, our data into R. The function used to import your data will depend on the file format of the data you are trying to import, but R is pretty adept at working with many files types.
Take a look in the data folder in your Files pane. You should see the following .csv files:
dlt1-edgelist.csv
dlt1-nodes.csv
As its name implies, the first file dlt1-edgelist.csv is an edge-list that contains information about each tie, or relation between two actors in a network. In this context, a “tie” is a reply by one participant in the discussion forum to the post of another participant - or in some cases to their own post! These ties between a single actor are called “self-loops” and as we’ll see later in this section, igraph has a special function to remove these self loops from a sociogram, or network visualization.
The edge-list format is slightly different than other formats you have likely worked with before in that the values in the first two columns each row represent a dyad, or tie between two nodes in a network. An edge-list can also contain other information regarding the strength or duration of the relationship, sometime called “weight”, in addition to other “edge attributes.”
In addition to
Sender = Unique identifier of author of comment
Receiver = Unique identifier of identified recipient of comment
Timestamp = Time comment was posted
Parent = Primary category or topic of thread
Category = Subcategory or subtopic of thread
Thread_id = Unique identifier of a thread
Comment_id = Unique identifier of a comment\
Let’s use the read_csv() function from the {readr} package introduced in the Getting Started walkthrough to read in our edge-list and print the new ties data frame:
ties <- read_csv("data/dlt1-edgelist.csv",
col_types = cols(Sender = col_character(),
Receiver = col_character(),
`Category Text` = col_skip(),
`Comment ID` = col_character(),
`Discussion ID` = col_character()))
ties
Note the addition of the col_types = argument for changing the column types to character strings since the numbers for for those particular columns indicate actors (Sender and Reciever) and attributes (Comment_ID and Discussion_Id). We also skipped the Category Text.
RStudio Tip: Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.
Consider the example pictured below of a discussion thread from the Planning for the Digital Learning Transition in K-12 Schools (DLT 1) where our data orginated. This thread was initiated by participant I, so the comments by J and N are considered to be directed at I. The comment of B, however, is a direct response to the comment by N as signaled by the use of the quote-feature as well as the explicit mentioning of N’s name within B’s comment.
Now answer the following questions as they relate to the DLT 1 edge-list we just read into R.
Which actors in this thread are the Sender and the Reciever? Which actor is both?
How many dyads are in this thread? Which pairs of actors are dyads?
Sidebar: Unfortunately, these types of nuances in discussion forum data as illustrated by this simple example are rarely captured through automated approaches to constructing networks. Fortunately, the dataset you are working with was carefully reviewed to try and capture more accurately the intended recipients of each reply.
The second file is we’ll be using to help understand out network and the actors involved contains all the nodes or actors (i.e., participants who posted to the discussion forum) as well as some of their attributes such as gender and years of experience in education.
Carolyn (2013) notes that most social network analyses include variables that describe attributes of actors, ones that are either categorical (e.g., sex, race, etc.) or continuous in nature (e.g., test scores, number of times absent, etc.). These attributes that can be incorporated into a network graph or model to making it much more informative and can aid in testing or generating hypotheses.
These attribute variables are typically included in a rectangular array, or dataframe, that mimics the actor-by-attribute that is the dominant convention in social science, i.e. rows represent cases, columns represent variables, and cells consist of values on those variables.
As and aside, Carolyn also refers to this historical preference by researchers for “actor-by-attribute” data, in the absence of relational data in which the actor has been removed their social context, as the “sociological meatgrinder” in action. Specifically, this historical approach assumes that the actor does not interact with anyone else in the study and that outcomes are solely dependent of the characteristics of the individual.
Regardless, let’s read in our node attribute file and take a look at the actors and their attributes included in our dataset:
actors <- read_csv("data/dlt1-nodes.csv",
col_types = cols(UID = col_character(),
Facilitator = col_character(),
expert = col_character(),
connect = col_character()))
Use the code chunk below and a function of your choosing to take a look at the actors data frame:
Match up the attributes included in the node file with the following codebook descriptors. The first one has been done as an example.
Facilitator = Identification of course facilitator (1 = instructor)network <- graph_from_data_frame(d=ties,
vertices=actors,
directed=T)
network
IGRAPH d47d41c DN-- 445 2529 --
+ attr: name (v/c), Facilitator (v/c), role1 (v/c), experience (v/n), experience2
| (v/c), grades (v/c), location (v/c), region (v/c), country (v/c), group (v/c),
| gender (v/c), expert (v/c), connect (v/c), Timestamp (e/c), Discussion Title
| (e/c), Discussion Category (e/c), Parent Category (e/c), Discussion Identifier
| (e/c), Comment ID (e/c), Discussion ID (e/c)
+ edges from d47d41c (vertex names):
[1] 360->444 356->444 356->444 344->444 392->444 219->444 318->444 4 ->444 355->356 355->444
[11] 4 ->444 310->444 248->444 150->444 19 ->310 216->19 19 ->444 19 ->4 217->310 385->444
[21] 217->444 393->444 217->19 256->219 253->444 301->444 301->444 143->444 218->19 361->217
[31] 30 ->444 30 ->444 335->444 166->444 156->219 173->444 223->444 219->19 219->253 261->444
+ ... omitted several edges